219 research outputs found

    STEM: a tool for the analysis of short time series gene expression data

    Get PDF
    BACKGROUND: Time series microarray experiments are widely used to study dynamical biological processes. Due to the cost of microarray experiments, and also in some cases the limited availability of biological material, about 80% of microarray time series experiments are short (3–8 time points). Previously short time series gene expression data has been mainly analyzed using more general gene expression analysis tools not designed for the unique challenges and opportunities inherent in short time series gene expression data. RESULTS: We introduce the Short Time-series Expression Miner (STEM) the first software program specifically designed for the analysis of short time series microarray gene expression data. STEM implements unique methods to cluster, compare, and visualize such data. STEM also supports efficient and statistically rigorous biological interpretations of short time series data through its integration with the Gene Ontology. CONCLUSION: The unique algorithms STEM implements to cluster and compare short time series gene expression data combined with its visualization capabilities and integration with the Gene Ontology should make STEM useful in the analysis of data from a significant portion of all microarray studies. STEM is available for download for free to academic and non-profit users at

    Inferring interactions, expression programs and regulatory networks from high throughput biological data

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (leaves 171-180).(cont.) For the networks level I present an algorithm that efficiently combines complementary large-scale expression and protein-DNA binding data to discover co-regulated modules of genes. This algorithm is extended so that it can infer sub-networks for specific systems in the cell. Finally, I present an algorithm which combines some of the above methods to automatically infer a dynamic sub-network for the cell cycle system.In this thesis I present algorithms for analyzing high throughput biological datasets. These algorithms work on a number of different analysis levels to infer interactions between genes, determine gene expression programs and model complex biological networks. Recent advances in high-throughput experimental methods in molecular biology hold great promise. DNA microarray technologies enable researchers to measure the expression levels of thousands of genes simultaneously. Time series expression data offers particularly rich opportunities for understanding the dynamics of biological processes. In addition to measuring expression data, microarrays have been recently exploited to measure genome-wide protein-DNA binding events. While these types of data are revolutionizing biology, they also present many computational challenges. Principled computational methods are required in order to make full use of each of these datasets, and to combine them to infer interactions and discover networks for modeling different systems in the cell. The algorithms presented in this thesis address three different analysis levels of high throughput biological data: Recovering individual gene values, pattern recognition and networks. For time series expression data, I present algorithms that permit the principled estimation of unobserved time-points, alignment and the identification of differentially expressed genes. For pattern recognition, I present algorithms for clustering continuous data, and for ordering the leaves of a clustering tree to infer expression programs.by Ziv Bar-Joseph.Ph.D

    Biological interaction networks are conserved at the module level

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Orthologous genes are highly conserved between closely related species and biological systems often utilize the same genes across different organisms. However, while sequence similarity often implies functional similarity, interaction data is not well conserved even for proteins with high sequence similarity. Several recent studies comparing high throughput data including expression, protein-protein, protein-DNA, and genetic interactions between close species show conservation at a much lower rate than expected.</p> <p>Results</p> <p>In this work we collected comprehensive high-throughput interaction datasets for four model organisms (<it>S. cerevisiae, S. pombe, C. elegans</it>, and <it>D. melanogaster</it>) and carried out systematic analyses in order to explain the apparent lower conservation of interaction data when compared to the conservation of sequence data. We first showed that several previously proposed hypotheses only provide a limited explanation for such lower conservation rates. We combined all interaction evidences into an integrated network for each species and identified functional modules from these integrated networks. We then demonstrate that interactions that are part of functional modules are conserved at much higher rates than previous reports in the literature, while interactions that connect between distinct functional modules are conserved at lower rates.</p> <p>Conclusions</p> <p>We show that conservation is maintained between species, but mainly at the module level. Our results indicate that interactions within modules are much more likely to be conserved than interactions between proteins in different modules. This provides a network based explanation to the observed conservation rates that can also help explain why so many biological processes are well conserved despite the lower levels of conservation for the interactions of proteins participating in these processes.</p> <p>Accompanying website: <url>http://www.sb.cs.cmu.edu/CrossSP</url></p

    A mixture of feature experts approach for protein-protein interaction prediction

    Get PDF
    High-throughput methods can directly detect the set of interacting proteins in yeast but the results are often incomplete and exhibit high false positive and false negative rates. A number of researchers have recently presented methods for integrating direct and indirect data for predicting interactions. However, due to missing data and the high redundancy among the features used, different samples may benefit from different features based on the set of attributes available. In addition, in many cases it is hard to directly determine which of the datasets led to the prediction, which is an important issue for the biologists using these predications to design new experiments. To address these challenges we use a Mixture-of-Experts method. We split the data into four (roughly) homogeneous sets. The individual experts use logistic regression and their scores are combined using another logistic regression. However, when combining the scores the weighting of each expert depends on the set of input attributes. Thus different experts will have different influence on the prediction depending on the available features. We applied our method to predict the set of interacting proteins in yeast. Our method improved upon the best previous methods for this task. In addition, using the weighting of the experts the prediction can be easily evaluated by biologists based on the features that they feel are the most reliable.

    Predicting tissue specific transcription factor binding sites

    Full text link

    Evolutionary divergence in the fungal response to fluconazole revealed by soft clustering

    Get PDF
    Background: Fungal infections are an emerging health risk, especially those involving yeast that are resistant to antifungal agents. To understand the range of mechanisms by which yeasts can respond to anti-fungals, we compared gene expression patterns across three evolutionarily distant species- Saccharomyces cerevisiae, Candida glabrata and Kluyveromyces lactis- over time following fluconazole exposure. Results: Conserved and diverged expression patterns were identified using a novel soft clustering algorithm that concurrently clusters data from all species while incorporating sequence orthology. The analysis suggests complementary strategies for coping with ergosterol depletion by azoles- Saccharomyces imports exogenous ergosterol, Candida exports fluconazole, while Kluyveromyces does neither, leading to extreme sensitivity. In support of this hypothesis we find that only Saccharomyces becomes more azole resistant in ergosterol-supplemented media; that this depends on sterol importers Aus1 and Pdr11; and that transgenic expression of sterol importers in Kluyveromyces alleviates its drug sensitivity. Conclusions: We have compared the dynamic transcriptional responses of three diverse yeast species to fluconazole treatment using a novel clustering algorithm. This approach revealed significant divergence among regulatory programs associated with fluconazole sensitivity. In future, such approaches might be used to survey

    Backup in gene regulatory networks explains differences between binding and knockout results

    Get PDF
    The complementarity of gene expression and protein–DNA interaction data led to several successful models of biological systems. However, recent studies in multiple species raise doubts about the relationship between these two datasets. These studies show that the overwhelming majority of genes bound by a particular transcription factor (TF) are not affected when that factor is knocked out. Here, we show that this surprising result can be partially explained by considering the broader cellular context in which TFs operate. Factors whose functions are not backed up by redundant paralogs show a fourfold increase in the agreement between their bound targets and the expression levels of those targets. In addition, we show that incorporating protein interaction networks provides physical explanations for knockout effects. New double knockout experiments support our conclusions. Our results highlight the robustness provided by redundant TFs and indicate that in the context of diverse cellular systems, binding is still largely functional

    Dynamic Bayesian networks for integrating multi-omics time-series microbiome data

    Get PDF
    . A key challenge in the analysis of longitudinal microbiomes data is to go beyond computing their compositional profiles and infer the complex web of interactions between the various microbial taxa, their genes, and the metabolites they consume and produce. To address this challenge, we developed a computational pipeline that first aligns multi-omics data and then uses dynamic Bayesian networks (DBNs) to integrate them into a unified model. We discuss how our approach handles the different sampling and progression rates between individuals, how we reduce the large number of different entities and parameters in the DBNs, and the construction and use of a validation set to model edges. Applying our method to data collected from Inflammatory Bowel Disease (IBD) patients, we show that it can accurately identify known and novel interactions between various entities and can improve on current methods for learning such interactions. Experimental validations support several predictions about novel metabolite-taxa interactions. The source code is freely available under the MIT Open Source license agreement and can be downloaded from https://github.com/DaniRuizPerez/longitudinal_multiomic_analysis_public
    corecore